Pandas

Introduction to pandas

pandas is a package built on top of numpy. The core object of this package is the DataFrame, which are multidimensional datasets with row and column labels. In addition to offering an interface for tagged data, pandas also has powerful data manipulation functions.

Installing pandas

Installation of pandas requires that you first have numpy installed. To install it, use the command:

# Installing using pip
pip install pandas

# Installing using conda
conda install pandas
In [144]:
# If you don't have pandas installed
!pip install pandas
Requirement already satisfied: pandas in /opt/conda/lib/python3.6/site-packages (0.23.4)
Requirement already satisfied: pytz>=2011k in /opt/conda/lib/python3.6/site-packages (from pandas) (2018.7)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/lib/python3.6/site-packages (from pandas) (2.7.5)
Requirement already satisfied: numpy>=1.9.0 in /opt/conda/lib/python3.6/site-packages (from pandas) (1.13.3)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas) (1.12.0)

After installation, you can import it. The version can be checked as follows:

In [150]:
# Import libraries
import pandas as pd
import numpy as np
In [151]:
# Pandas Version
pd.__version__
Out[151]:
'0.23.4'

pandas objects

In [152]:
from IPython.display import Image
Image(filename='base_01_pandas_5_0.png', width=500)
# Resource: https://bookdata.readthedocs.io/en/latest/base/01_pandas.html
Out[152]:

Let's work with the famous Titanic dataset. Here are the available fields:

  • passengerId: A numerical ID assigned to each passenger
  • survived: Whether the passenger survived (1), or didn't (0)
  • pclass: The class the passenger was in
  • name: the name of the passenger
  • sex: The gender of the passenger - male or female
  • Age: The age of the passenger (includes fractional values)
  • sibsp: The number of siblings and spouses the passenger had on board
  • parch: The number of parents and children the passenger had on board
  • ticket: The ticket number of the passenger
  • fare: How much the passenger paid for the ticket
  • cabin: Which cabin the passenger was in
  • embarked: Where the passenger boarded the Titanic
In [153]:
# Load the file
titanic = pd.read_csv("Datasets/train.csv")
In [154]:
titanic.index
Out[154]:
RangeIndex(start=0, stop=891, step=1)
In [155]:
titanic.columns
Out[155]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
In [157]:
for i in titanic.columns:
    print(i)
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
In [8]:
type(titanic.PassengerId)
Out[8]:
pandas.core.series.Series
In [9]:
type(titanic)
Out[9]:
pandas.core.frame.DataFrame
In [10]:
titanic.values
Out[10]:
array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ..., 
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

Indexers: loc and iloc

Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular way to retrieve slices of the data.

The loc attribute allows indexing and segmenting that always refers to the explicit index. The iloc attribute allows indexing and segmenting that always refers to the implicit Python-style index. To illustrate the difference, let's look at the examples below. Notice that loc is inclusive of both ends of specified ranges, and that columns are specified by name. On the other hand, iloc is exclusive of the upper end of specified ranges, and columns are specified by numerical index.

In [160]:
titanic.loc[10:15,'PassengerId':'Age']
Out[160]:
PassengerId Survived Pclass Name Sex Age
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0
In [12]:
titanic.iloc[10:15,0:5]
Out[12]:
PassengerId Survived Pclass Name Sex
10 11 1 3 Sandstrom, Miss. Marguerite Rut female
11 12 1 1 Bonnell, Miss. Elizabeth female
12 13 0 3 Saundercock, Mr. William Henry male
13 14 0 3 Andersson, Mr. Anders Johan male
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female

Data selection

Now that we know a bit about how pandas indexes data, we can look at how to exrtract it. pandas has a number of ways for selecting subsets of a dataset:

In [13]:
# slicing by implicit index
titanic[10:15]
Out[13]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
In [161]:
# fancy indexing
titanic[10:15][['PassengerId','Age']]
Out[161]:
PassengerId Age
10 11 4.0
11 12 58.0
12 13 20.0
13 14 39.0
14 15 14.0
In [163]:
# masking
titanic[(titanic.Age > 18) & (titanic['Age'] < 50)][10:15]
Out[163]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
30 31 0 1 Uruchurtu, Don. Manuel E male 40.0 0 0 PC 17601 27.7208 NaN C
In [16]:
titanic['Age'][10:15]
Out[16]:
10     4.0
11    58.0
12    20.0
13    39.0
14    14.0
Name: Age, dtype: float64

Anatomy of a DataFrame

In [17]:
from IPython.display import Image
Image(filename='AnatomyDataFrame.png', width=500)
#Resource: https://cvw.cac.cornell.edu/PyDataSci1/arrays_dataframes
Out[17]:

concat()

pandas has a function, pd.concat(), which has a syntax similar to np.concatenate but contains different arguments. It can be used for a simple concatenation of Series or DataFrames:

In [164]:
df1 = titanic[(titanic.Age > 18) & (titanic['Age'] < 50)][10:15][['PassengerId','Survived']]
df1.reset_index(inplace = True)
#df1
Out[164]:
index PassengerId Survived
0 21 22 1
1 23 24 1
2 25 26 1
3 27 28 0
4 30 31 0
In [19]:
df2 = titanic.iloc[10:15,2:5]
df2.reset_index(inplace = True)
#df2
In [20]:
pd.concat([df1, df2],sort=True, axis=1)
Out[20]:
index PassengerId Survived index Pclass Name Sex
0 21 22 1 10 3 Sandstrom, Miss. Marguerite Rut female
1 23 24 1 11 1 Bonnell, Miss. Elizabeth female
2 25 26 1 12 3 Saundercock, Mr. William Henry male
3 27 28 0 13 3 Andersson, Mr. Anders Johan male
4 30 31 0 14 3 Vestrom, Miss. Hulda Amanda Adolfina female
In [21]:
pd.concat([df1, df2],ignore_index=True,sort=True, axis=0)
Out[21]:
Name PassengerId Pclass Sex Survived index
0 NaN 22.0 NaN NaN 1.0 21
1 NaN 24.0 NaN NaN 1.0 23
2 NaN 26.0 NaN NaN 1.0 25
3 NaN 28.0 NaN NaN 0.0 27
4 NaN 31.0 NaN NaN 0.0 30
5 Sandstrom, Miss. Marguerite Rut NaN 3.0 female NaN 10
6 Bonnell, Miss. Elizabeth NaN 1.0 female NaN 11
7 Saundercock, Mr. William Henry NaN 3.0 male NaN 12
8 Andersson, Mr. Anders Johan NaN 3.0 male NaN 13
9 Vestrom, Miss. Hulda Amanda Adolfina NaN 3.0 female NaN 14

merge() and join()

These functions allow data from different dataframes to be combined into one according to a "crossover" or "search" rule.

The merge() function is the default pandas function for merging data. It's basically pandas' counterpart of SQL JOINs, and requires specifying which columns of both DataFrames will be compared. merge() doesn't care at all about the indices defined on them.

On the other hand, join() is basically doing a merge but taking advantage of the indices of both DataFrames.

The following figure summarizes the different 4 types of JOINs: inner, outer, left and right.

MERGE

The merge() function is also available as a method in the DataFrame class. The basic syntax is:

new_joined_df = df.merge (another_df, left_on = "col_in_df", right_on = "col_in_another_df",
                          how="inner"|"left"|"right"|"outer")

The first argument (another_df), as well as left_on and right_on are required arguments. left_on specifies a column name in df whose values must match those in column another_df 'specified in right_on. The how argument is optional and defaults to inner.

In [165]:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
student_df = student_df.set_index('Name')
In [166]:
pd.merge(staff_df, student_df,left_index=True, right_index=True)
Out[166]:
Role School
Name
Sally Course liasion Engineering
James Grader Business
In [167]:
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)
Out[167]:
Role School
Name
James Grader Business
Kelly Director of HR NaN
Mike NaN Law
Sally Course liasion Engineering
In [25]:
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)
Out[25]:
Role School
Name
Kelly Director of HR NaN
Sally Course liasion Engineering
James Grader Business
In [26]:
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)
Out[26]:
Role School
Name
James Grader Business
Mike NaN Law
Sally Course liasion Engineering
In [27]:
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()
In [28]:
pd.merge(staff_df, student_df, how="left", left_on="Name", right_on="Name")
Out[28]:
Name Role School
0 Kelly Director of HR NaN
1 Sally Course liasion Engineering
2 James Grader Business
In [29]:
pd.merge(staff_df, student_df, how="left",on="Name")
Out[29]:
Name Role School
0 Kelly Director of HR NaN
1 Sally Course liasion Engineering
2 James Grader Business
In [30]:
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}])
In [31]:
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering'}])
In [32]:
pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])
Out[32]:
First Name Last Name Role School
0 Sally Brooks Course liasion Engineering

Aggregation and grouping

An essential part of analyzing lots of data is creating efficient summaries of it. Computational aggregations, such as sum(), mean(), median(), min() and max(), allow us to use relatively few numbers to give us an idea of the nature of a potentially large dataset.

In this section, we will explore aggregations in pandas, from simple operations similar to what we have seen in numpy arrays, to more sophisticated operations based on the concept of groupby().

In [33]:
titanic['Age'].mean()
Out[33]:
29.69911764705882
In [168]:
titanic.groupby('Sex').size()
Out[168]:
Sex
female    314
male      577
dtype: int64
In [169]:
titanic.groupby('Sex').count()
Out[169]:
PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked
Sex
female 314 314 314 314 261 314 314 314 314 97 312
male 577 577 577 577 453 577 577 577 577 107 577
In [171]:
prueba = titanic.groupby('Sex').agg({'Age':['min','mean','max']})
In [172]:
prueba.loc()
Out[172]:
<pandas.core.indexing._LocIndexer at 0x7f4707711ae8>

crosstab()

One of the useful functions in pandas is crosstab(). This allows us to segregate our data into buckets using the classes of a few categorical variables, then construct a table showing the number of data points in each bucket:

In [36]:
pd.crosstab(titanic['Sex'], titanic['Pclass'])
Out[36]:
Pclass 1 2 3
Sex
female 94 76 144
male 122 108 347

Pivot tables

These are just like the pivot tables in Excel. Here, we use it to look at the average age of passengers, segregated by class and gender:

In [37]:
titanic.pivot_table(values='Age',index='Sex',columns='Pclass', aggfunc=np.mean)
Out[37]:
Pclass 1 2 3
Sex
female 34.611765 28.722973 21.750000
male 41.281386 30.740707 26.507589

A little bit about dates in pandas

pandas has some functionaliy for working with dates as well:

In [183]:
import seaborn as sns

# Load dataset
flights = sns.load_dataset('flights')
flights.head(2)
Out[183]:
year month passengers
0 1949 January 112
1 1949 February 118
In [184]:
flights['date'] = flights['year'].astype(str)+ ' ' + flights['month'].astype(str)

flights.head(2)
Out[184]:
year month passengers date
0 1949 January 112 1949 January
1 1949 February 118 1949 February
In [185]:
flights['date'] = pd.to_datetime(flights['date'])
flights.head(2)
Out[185]:
year month passengers date
0 1949 January 112 1949-01-01
1 1949 February 118 1949-02-01
In [188]:
flights.set_index('date',inplace=True)
In [190]:
flights.groupby(pd.Grouper(freq='Q')).agg({'passengers':'mean'}).plot()
Out[190]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47072f8470>
In [192]:
flights.resample('y').mean()
Out[192]:
year passengers
date
1949-12-31 1949.0 126.666667
1950-12-31 1950.0 139.666667
1951-12-31 1951.0 170.166667
1952-12-31 1952.0 197.000000
1953-12-31 1953.0 225.000000
1954-12-31 1954.0 238.916667
1955-12-31 1955.0 284.000000
1956-12-31 1956.0 328.250000
1957-12-31 1957.0 368.416667
1958-12-31 1958.0 381.000000
1959-12-31 1959.0 428.333333
1960-12-31 1960.0 476.166667
In [193]:
flights.index.weekday_name
Out[193]:
Index(['Saturday', 'Tuesday', 'Tuesday', 'Friday', 'Sunday', 'Wednesday',
       'Friday', 'Monday', 'Thursday', 'Saturday',
       ...
       'Tuesday', 'Friday', 'Sunday', 'Wednesday', 'Friday', 'Monday',
       'Thursday', 'Saturday', 'Tuesday', 'Thursday'],
      dtype='object', name='date', length=144)
In [196]:
flights.resample('y').mean()['passengers'].diff(2)
Out[196]:
date
1949-12-31          NaN
1950-12-31          NaN
1951-12-31    43.500000
1952-12-31    57.333333
1953-12-31    54.833333
1954-12-31    41.916667
1955-12-31    59.000000
1956-12-31    89.333333
1957-12-31    84.416667
1958-12-31    52.750000
1959-12-31    59.916667
1960-12-31    95.166667
Freq: A-DEC, Name: passengers, dtype: float64
In [46]:
flights['1950-12-31':].head()
Out[46]:
year month passengers
date
1951-01-01 1951 January 145
1951-02-01 1951 February 150
1951-03-01 1951 March 178
1951-04-01 1951 April 163
1951-05-01 1951 May 172
In [197]:
flights.reset_index(inplace=True)
flights['weekday_name'] = flights['date'].dt.weekday_name
In [48]:
flights.groupby('weekday_name').size()
Out[48]:
weekday_name
Friday       21
Monday       19
Saturday     20
Sunday       21
Thursday     21
Tuesday      23
Wednesday    19
dtype: int64
In [49]:
flights['weekday_name'] = pd.Categorical(flights['weekday_name'], categories=
    ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'],
    ordered=True)
In [50]:
flights.groupby('weekday_name').size()
Out[50]:
weekday_name
Monday       19
Tuesday      23
Wednesday    19
Thursday     21
Friday       21
Saturday     20
Sunday       21
dtype: int64

Exploring the data

As we go through the dataset, knowing some functions that pandas provide us to make our lives easier is crucial. Here, we want to identify the types of data in our dataset, if we have nulls in our dataset, etc. Some functions that can help us for this include:

df.head()
df.tail()
df.info()
df.shape
df.columns
df.describe()
df.value_counts()
df.unique()
df.nunique()
In [51]:
# See first two rows
titanic.head(2)
Out[51]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
In [52]:
# See last two rows
titanic.tail(2)
Out[52]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
In [53]:
# How many rows and columns in dataset
titanic.shape
Out[53]:
(891, 12)
In [54]:
# Data types and amount of null values in dataset
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
In [55]:
# Another way to see null values per column
titanic.isnull().sum()
Out[55]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
In [56]:
# Another way to see data types
titanic.dtypes
Out[56]:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
In [57]:
# Information about numeric columns
titanic.describe()
Out[57]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [58]:
# Information about categorical columns
titanic.describe(include = ['O'])
Out[58]:
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Todoroff, Mr. Lalio male 347082 G6 S
freq 1 577 7 4 644
In [59]:
# Unique values per column excluding null
titanic.nunique()
Out[59]:
PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64
In [60]:
# Unique values per column including null
titanic.nunique(dropna=False)
Out[60]:
PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             89
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          148
Embarked         4
dtype: int64
In [61]:
# Counting
titanic.count()
Out[61]:
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64
In [62]:
# Most frequent value in the series
titanic['Embarked'].mode()

# Unique values of the column 'Embarked'
titanic['Embarked'].unique()
Out[62]:
array(['S', 'C', 'Q', nan], dtype=object)
In [63]:
# How many records do we have by category of the column 'Embarked' without null values
titanic['Embarked'].value_counts()
Out[63]:
S    644
C    168
Q     77
Name: Embarked, dtype: int64
In [64]:
# How many records do we have by category of the column 'Embarked' with null values
titanic['Embarked'].value_counts(dropna=False)
Out[64]:
S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64
In [65]:
# Names of the columns in the dataset
titanic.columns
Out[65]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
In [66]:
titanic.columns = ['Passenger_Id', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
titanic = titanic.rename(columns={'Passenger_Id':'PassengerId'})
titanic.head()
Out[66]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [67]:
?titanic.rename
In [68]:
titanic[(titanic['Embarked']=='S') & (titanic['Sex']=='female')][['Name','Age']].head()
Out[68]:
Name Age
2 Heikkinen, Miss. Laina 26.0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 27.0
10 Sandstrom, Miss. Marguerite Rut 4.0
11 Bonnell, Miss. Elizabeth 58.0

Exercise 1:

Using the sex, Age, and fare columns, answer the following questions:

  1. How many Men and Women do we have in our dataset?
  2. What is the most frequent age in our dataset?
  3. What is the most common age for women?
  4. What is the most common age for men?
  5. What is the most common ticket price?
  6. Is the price of the most common ticket the same for men and women?
In [69]:
# Your code ...

This is what you should get for #1:

male      577
female    314
Name: Sex, dtype: int64

Double-click Here to see the solution.

In [70]:
# Your code ...

This is what you should get for #2:

0    24.0
dtype: float64

Double-click Here to see the solution.

In [71]:
# Your code ...

This is what you should get for #3:

0    24.0
dtype: float64

Double-click Here to see the solution.

In [72]:
# Your code ...

This is what you should get for #4:

0    19.0
1    25.0
2    28.0
dtype: float64

Double-click Here to see the solution.

In [73]:
# Your code ...

This is what you should get for #5:

0    8.05
dtype: float64

Double-click Here to see the solution.

In [74]:
# Your code ...

This is what you should get for #6:

Common ticket price for women :7.75
Common ticket price for men :8.05

Double-click Here to see the solution.

Pre-processing our data

Now that we have an idea of what our dataset consists of, let's transform it so that we can display phase. The types of activities we may engage in during pre-processing include:

  1. Deleting columns
  2. Enriching (or Transforming) a data set, adding newly calculated columns in the indices
  3. Filtering a subset of the rows or columns of a dataset according to some criteria
  4. Indexing a dataset
  5. Aggregating data
  6. Sorting the rows of a data set according to some criteria
  7. Merging the data
  8. Pivoting so that data that was originally laid out vertically is laid out horizontally (increasing the number of columns) or vice versa (increasing the number of rows)

among others.

In [75]:
titanic.columns
Out[75]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
In [76]:
# Delete columns PassengerId, Name, Cabin, we are not going to delete the ticket column yet.
titanic.drop(['PassengerId','Name','Cabin'],axis=1, inplace=True)
titanic.shape
Out[76]:
(891, 9)
In [77]:
# Null validation again
titanic.isnull().sum()
Out[77]:
Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Embarked      2
dtype: int64
In [78]:
# In the first 10 records we have a null age field, we are going to have it as a reference for what we are going to do
titanic.head(10)
Out[78]:
Survived Pclass Sex Age SibSp Parch Ticket Fare Embarked
0 0 3 male 22.0 1 0 A/5 21171 7.2500 S
1 1 1 female 38.0 1 0 PC 17599 71.2833 C
2 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 S
3 1 1 female 35.0 1 0 113803 53.1000 S
4 0 3 male 35.0 0 0 373450 8.0500 S
5 0 3 male NaN 0 0 330877 8.4583 Q
6 0 1 male 54.0 0 0 17463 51.8625 S
7 0 3 male 2.0 3 1 349909 21.0750 S
8 1 3 female 27.0 0 2 347742 11.1333 S
9 1 2 female 14.0 1 0 237736 30.0708 C

Null values

Depending on the problem, the treatment of nulls can be handled in different ways:

  • Keep them
  • Delete them
  • Leave a fixed value that differentiates them
  • Replace them with a summary statistic (mean, median, mode)
  • Forward or backward fill
  • Interpolate

You will learn about when to do each in a later case.

In [79]:
# Delete
titanic.dropna(inplace=True)
titanic.shape
Out[79]:
(712, 9)
In [80]:
# Null validation
titanic.isnull().sum()
Out[80]:
Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Ticket      0
Fare        0
Embarked    0
dtype: int64
In [81]:
# In the first 10 records we have a null age field, we are going to have it as a reference for what we are going to do
titanic.head(10)
Out[81]:
Survived Pclass Sex Age SibSp Parch Ticket Fare Embarked
0 0 3 male 22.0 1 0 A/5 21171 7.2500 S
1 1 1 female 38.0 1 0 PC 17599 71.2833 C
2 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 S
3 1 1 female 35.0 1 0 113803 53.1000 S
4 0 3 male 35.0 0 0 373450 8.0500 S
6 0 1 male 54.0 0 0 17463 51.8625 S
7 0 3 male 2.0 3 1 349909 21.0750 S
8 1 3 female 27.0 0 2 347742 11.1333 S
9 1 2 female 14.0 1 0 237736 30.0708 C
10 1 3 female 4.0 1 1 PP 9549 16.7000 S
In [82]:
titanic.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 9 columns):
Survived    712 non-null int64
Pclass      712 non-null int64
Sex         712 non-null object
Age         712 non-null float64
SibSp       712 non-null int64
Parch       712 non-null int64
Ticket      712 non-null object
Fare        712 non-null float64
Embarked    712 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 55.6+ KB
In [83]:
titanic.reset_index(drop=True,inplace=True)
titanic.head(10)
Out[83]:
Survived Pclass Sex Age SibSp Parch Ticket Fare Embarked
0 0 3 male 22.0 1 0 A/5 21171 7.2500 S
1 1 1 female 38.0 1 0 PC 17599 71.2833 C
2 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 S
3 1 1 female 35.0 1 0 113803 53.1000 S
4 0 3 male 35.0 0 0 373450 8.0500 S
5 0 1 male 54.0 0 0 17463 51.8625 S
6 0 3 male 2.0 3 1 349909 21.0750 S
7 1 3 female 27.0 0 2 347742 11.1333 S
8 1 2 female 14.0 1 0 237736 30.0708 C
9 1 3 female 4.0 1 1 PP 9549 16.7000 S
In [84]:
# Leave a fixed value that differentiates them
titanic = pd.read_csv("Datasets/train.csv")
titanic.drop(['PassengerId','Name','Cabin'],axis=1, inplace=True)
In [85]:
titanic.shape
Out[85]:
(891, 9)
In [86]:
titanic.fillna(-1).head(10)
Out[86]:
Survived Pclass Sex Age SibSp Parch Ticket Fare Embarked
0 0 3 male 22.0 1 0 A/5 21171 7.2500 S
1 1 1 female 38.0 1 0 PC 17599 71.2833 C
2 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 S
3 1 1 female 35.0 1 0 113803 53.1000 S
4 0 3 male 35.0 0 0 373450 8.0500 S
5 0 3 male -1.0 0 0 330877 8.4583 Q
6 0 1 male 54.0 0 0 17463 51.8625 S
7 0 3 male 2.0 3 1 349909 21.0750 S
8 1 3 female 27.0 0 2 347742 11.1333 S
9 1 2 female 14.0 1 0 237736 30.0708 C
In [87]:
titanic['Age'].fillna(titanic['Age'].mean()).head(10)
Out[87]:
0    22.000000
1    38.000000
2    26.000000
3    35.000000
4    35.000000
5    29.699118
6    54.000000
7     2.000000
8    27.000000
9    14.000000
Name: Age, dtype: float64
In [88]:
titanic['Age'].fillna(titanic['Age'].median()).head(10)
Out[88]:
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5    28.0
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64
In [89]:
titanic['Age'].fillna(titanic['Age'].mode()[0]).head(10)
Out[89]:
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5    24.0
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64
In [90]:
titanic['Age'].fillna(method='backfill').head(10)
Out[90]:
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5    54.0
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64
In [91]:
titanic['Age'].fillna(method='ffill').head(10)
Out[91]:
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5    35.0
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64
In [92]:
titanic['Age'].interpolate(method='polynomial', order=2).head(10)
Out[92]:
0    22.000000
1    38.000000
2    26.000000
3    35.000000
4    35.000000
5    56.320363
6    54.000000
7     2.000000
8    27.000000
9    14.000000
Name: Age, dtype: float64

For our exercise we will make the decision to drop the nulls.

In [93]:
# Drop nulls
titanic.dropna(inplace=True)
titanic.shape
Out[93]:
(712, 9)

Column renaming

To be practical, we normally modify the names of the columns in order to remove blank spaces and special characters, and to convert uppercase to lowercase:

In [199]:
# Rename columns to lowercase
titanic.rename(columns = lambda col: col.lower(),inplace=True)
In [203]:
titanic.columns
Out[203]:
Index(['passengerid', 'survived', 'pclass', 'name', 'sex', 'age', 'sibsp',
       'parch', 'ticket', 'fare', 'cabin', 'embarked'],
      dtype='object')
In [204]:
# Validate dataset
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
passengerid    891 non-null int64
survived       891 non-null int64
pclass         891 non-null int64
name           891 non-null object
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
ticket         891 non-null object
fare           891 non-null float64
cabin          204 non-null object
embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
In [96]:
titanic.shape
Out[96]:
(712, 9)

Indexing

In [97]:
# After eliminating the nulls, the indexes must be reset, as you can see they still appear from 0 to 890 
# even though we have 712 rows.
titanic.reset_index(drop=True,inplace=True)
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 9 columns):
survived    712 non-null int64
pclass      712 non-null int64
sex         712 non-null object
age         712 non-null float64
sibsp       712 non-null int64
parch       712 non-null int64
ticket      712 non-null object
fare        712 non-null float64
embarked    712 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 50.1+ KB

Grouping

In [98]:
titanic.groupby(['sex','survived']).size()
Out[98]:
sex     survived
female  0            64
        1           195
male    0           360
        1            93
dtype: int64
In [99]:
titanic.groupby(['sex','survived']).agg({'fare':'count','age':'mean'}).reset_index()
Out[99]:
sex survived fare age
0 female 0 64 25.046875
1 female 1 195 28.630769
2 male 0 360 31.618056
3 male 1 93 27.276022
In [100]:
# Validate duplicate tickets
df_tickets = titanic.groupby('ticket').size().reset_index().rename(columns={0:'count'})
df_tickets.sort_values(by='count',ascending=False)
Out[100]:
ticket count
273 347082 7
445 CA 2144 6
207 3101295 6
277 347088 6
490 S.O.C. 14879 5
376 382652 5
77 19950 4
65 1601 4
28 113781 4
451 LINE 4
159 2666 4
25 113760 4
320 349909 4
269 347077 4
537 W./C. 6608 4
341 35273 3
441 C.A. 34651 3
345 363291 3
195 29106 3
67 17421 3
284 347742 3
1 110413 3
436 C.A. 31921 3
59 13502 3
248 345773 3
93 230080 3
0 110152 3
463 PC 17582 3
484 PC 17760 3
462 PC 17572 3
... ... ...
213 31418 1
212 312992 1
235 34218 1
237 343120 1
262 347067 1
238 343275 1
261 347064 1
260 347063 1
259 347062 1
258 347061 1
257 347060 1
56 13049 1
255 3460 1
254 345783 1
253 345781 1
252 345780 1
251 345779 1
250 345778 1
249 345774 1
57 13213 1
247 345770 1
246 345769 1
245 345767 1
244 345765 1
58 13214 1
242 345763 1
241 345572 1
240 345364 1
239 343276 1
270 347078 1

541 rows × 2 columns

In [101]:
titanic[titanic['ticket']=='347082']
Out[101]:
survived pclass sex age sibsp parch ticket fare embarked
12 0 3 male 39.0 1 5 347082 31.275 S
93 0 3 female 2.0 4 2 347082 31.275 S
428 0 3 female 9.0 4 2 347082 31.275 S
429 0 3 female 11.0 4 2 347082 31.275 S
481 0 3 female 39.0 1 5 347082 31.275 S
649 0 3 female 6.0 4 2 347082 31.275 S
676 0 3 male 4.0 4 2 347082 31.275 S
In [102]:
titanic[titanic['ticket']=='110413']
Out[102]:
survived pclass sex age sibsp parch ticket fare embarked
210 0 1 male 52.0 1 1 110413 79.65 S
442 1 1 female 39.0 1 1 110413 79.65 S
462 1 1 female 18.0 0 2 110413 79.65 S

Ordering

In [103]:
# Sort the number of records by tickets
df_tickets.sort_values(by=["count"],ascending=False,inplace=True)
df_tickets.head()
Out[103]:
ticket count
273 347082 7
445 CA 2144 6
207 3101295 6
277 347088 6
490 S.O.C. 14879 5
In [104]:
# Let's validate a ticket
titanic[titanic['ticket']=='347082']
Out[104]:
survived pclass sex age sibsp parch ticket fare embarked
12 0 3 male 39.0 1 5 347082 31.275 S
93 0 3 female 2.0 4 2 347082 31.275 S
428 0 3 female 9.0 4 2 347082 31.275 S
429 0 3 female 11.0 4 2 347082 31.275 S
481 0 3 female 39.0 1 5 347082 31.275 S
649 0 3 female 6.0 4 2 347082 31.275 S
676 0 3 male 4.0 4 2 347082 31.275 S

Merging datasets

In [105]:
# First we will add the column 'count' from df_tickets to our titanic dataset
titanic = titanic.merge(df_tickets,left_on='ticket',right_on='ticket',how='left')
titanic.shape
Out[105]:
(712, 10)
In [106]:
titanic.head()
Out[106]:
survived pclass sex age sibsp parch ticket fare embarked count
0 0 3 male 22.0 1 0 A/5 21171 7.2500 S 1
1 1 1 female 38.0 1 0 PC 17599 71.2833 C 1
2 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 S 1
3 1 1 female 35.0 1 0 113803 53.1000 S 2
4 0 3 male 35.0 0 0 373450 8.0500 S 1

Enrich dataset

We will enrich this dataset by creating a new column family, which will be "1" if the trip was taken with others, and "0" if they traveled alone:

In [107]:
titanic['family'] = [1 if count>1 else 0 for count in titanic['count']]
In [108]:
titanic.head()
Out[108]:
survived pclass sex age sibsp parch ticket fare embarked count family
0 0 3 male 22.0 1 0 A/5 21171 7.2500 S 1 0
1 1 1 female 38.0 1 0 PC 17599 71.2833 C 1 0
2 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 0
3 1 1 female 35.0 1 0 113803 53.1000 S 2 1
4 0 3 male 35.0 0 0 373450 8.0500 S 1 0

Cleaning text

In [109]:
# Remember the functions? we are going to create a function that will help us to clean text, in this case the tickets field
def clean_text(string):
    import re
    regex = '[^A-Za-z0-9]+'
    return re.sub(regex, '', string)
In [110]:
titanic['ticket'] = titanic['ticket'].apply(clean_text)
titanic.head()
Out[110]:
survived pclass sex age sibsp parch ticket fare embarked count family
0 0 3 male 22.0 1 0 A521171 7.2500 S 1 0
1 1 1 female 38.0 1 0 PC17599 71.2833 C 1 0
2 1 3 female 26.0 0 0 STONO23101282 7.9250 S 1 0
3 1 1 female 35.0 1 0 113803 53.1000 S 2 1
4 0 3 male 35.0 0 0 373450 8.0500 S 1 0
In [111]:
# To continue we will delete the ticket column
titanic.drop('ticket',axis=1,inplace=True)
titanic.head()
Out[111]:
survived pclass sex age sibsp parch fare embarked count family
0 0 3 male 22.0 1 0 7.2500 S 1 0
1 1 1 female 38.0 1 0 71.2833 C 1 0
2 1 3 female 26.0 0 0 7.9250 S 1 0
3 1 1 female 35.0 1 0 53.1000 S 2 1
4 0 3 male 35.0 0 0 8.0500 S 1 0

Encoding categorical variables

Sometimes, it is helpful to create a column for each possible category of one of our categorical variables. For this, we use a pandas method called get_dummies():

In [112]:
titanic['sexDummies'] = titanic['sex']
titanic = pd.get_dummies(titanic, columns = ['sexDummies'], prefix = ['D'])
titanic.head()
Out[112]:
survived pclass sex age sibsp parch fare embarked count family D_female D_male
0 0 3 male 22.0 1 0 7.2500 S 1 0 0 1
1 1 1 female 38.0 1 0 71.2833 C 1 0 1 0
2 1 3 female 26.0 0 0 7.9250 S 1 0 1 0
3 1 1 female 35.0 1 0 53.1000 S 2 1 1 0
4 0 3 male 35.0 0 0 8.0500 S 1 0 0 1

Exercise 2:

  1. Group the data set by gender (sex) and calculate the average age (age).
  2. Group the data set by the family column (family) and count the number of records per group.
  3. Of the families, how many survived and how many did not?
In [113]:
# Your code ...

This is what you should get for #1:

_ sex age
0 female 27.915709
1 male 30.726645

Double-click Here to see the solution.

In [114]:
# Your code ..

This is what you should get for #2:

_ family count
0 0 547
1 1 344

Double-click Here to see the solution.

In [115]:
# Your code ...

This is what you should get for #3:

_ family survived count
0 0 0 384
1 0 1 163
2 1 0 165
3 1 1 179

Double-click Here to see the solution.

Data visualizations

To conduct descriptive analysis of our data, we make use of various graphs to gather more information that we will use later on in the data science process when building our models. It is essential to have knowledge of the Python libraries that allow us to generate these visualizations, such as matplotlib and seaborn. pandas also has a few visualization methods.

In [116]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

titanic.head()
Out[116]:
survived pclass sex age sibsp parch fare embarked count family D_female D_male
0 0 3 male 22.0 1 0 7.2500 S 1 0 0 1
1 1 1 female 38.0 1 0 71.2833 C 1 0 1 0
2 1 3 female 26.0 0 0 7.9250 S 1 0 1 0
3 1 1 female 35.0 1 0 53.1000 S 2 1 1 0
4 0 3 male 35.0 0 0 8.0500 S 1 0 0 1

 Line chart

In [117]:
# Let's start with a simple graph, by not placing X it is assumed that x is each of the records.
plt.figure(figsize=(6,3))
plt.plot(titanic['age'])
plt.title('Titanic Graficando la Edad')
plt.xlabel('id')
plt.ylabel('age')
plt.show()
In [118]:
?plt

Histogram

In [119]:
plt.figure(figsize=(6,4))
sns.distplot(titanic['age'])
plt.title('Titanic Histograma de la Edad')
plt.show()
In [120]:
# If we only want the density function
plt.figure(figsize=(6,3))
sns.kdeplot(titanic['age'],shade=True)
plt.title('Titanic Densidad de la Edad')
plt.xlabel('age');
In [121]:
g = sns.FacetGrid(titanic, col="survived", height=3)
g.map(sns.kdeplot, 'age', shade=True)
g.despine(left=True,bottom=True);
In [122]:
g = sns.FacetGrid(titanic, col='survived', row='pclass', hue='sex', height=3)
g.map(sns.kdeplot, 'age', shade=True).add_legend()
g.despine(left=True, bottom=True)
plt.show()

Boxplot

In [123]:
plt.figure(figsize=(6,3))
sns.boxplot(titanic['age'])#,orient='v')
plt.title('Titanic Age Boxplot')
plt.show()
In [124]:
plt.figure(figsize=(6,4))
sns.boxplot(x='survived',y='age',data=titanic)
plt.title('Titanic Boxplot of survivors and age')
plt.show()
In [125]:
plt.figure(figsize=(10,4))
sns.boxplot(x='sex',y='age',hue='survived',data=titanic)
plt.title('Titanic Boxplot of survivors, gender and age')
plt.show()

Bar chart

In [126]:
plt.figure(figsize=(6, 4))
sns.countplot('survived',data=titanic)
plt.show()
In [127]:
plt.figure(figsize=(6, 4))
sns.barplot(y='age',x='pclass',data=titanic)
plt.show()
In [128]:
plt.figure(figsize=(6, 4))
sns.barplot(y='survived',x='sex',hue='pclass',data=titanic)
plt.show()
In [129]:
plt.figure(figsize=(6, 4))
sns.barplot(y='pclass',x='sex',hue='family',data=titanic)
plt.show()

Heatmap

In [130]:
titanic.drop(['survived','family','D_female','D_male','count'],axis=1).corr()
Out[130]:
pclass age sibsp parch fare
pclass 1.000000 -0.365902 0.065187 0.023666 -0.552893
age -0.365902 1.000000 -0.307351 -0.187896 0.093143
sibsp 0.065187 -0.307351 1.000000 0.383338 0.139860
parch 0.023666 -0.187896 0.383338 1.000000 0.206624
fare -0.552893 0.093143 0.139860 0.206624 1.000000
In [131]:
plt.figure()
sns.heatmap(titanic.drop(['survived','family','D_female','D_male','count'],axis=1).corr(),annot=True, linewidth=0.5,fmt='.1f')
plt.show()

 Scatterplot

In [132]:
plt.figure()
sns.scatterplot(x='age', y='fare', hue='sex', style='survived', data=titanic)
plt.show()
In [133]:
plt.figure()
sns.scatterplot(x='pclass', y='age', hue='sex', style='survived',size='count', data=titanic)
plt.legend(loc='upper left')
plt.show()
In [134]:
plt.figure()
sns.scatterplot(x='age', y='fare',data=titanic)
plt.show()
In [135]:
type(titanic.columns)
Out[135]:
pandas.core.indexes.base.Index
In [136]:
for  col in titanic.columns:
    print(col)
survived
pclass
sex
age
sibsp
parch
fare
embarked
count
family
D_female
D_male
In [137]:
!pip install plotly
Collecting plotly
  Downloading https://files.pythonhosted.org/packages/bf/5f/47ab0d9d843c5be0f5c5bd891736a4c84fa45c3b0a0ddb6b6df7c098c66f/plotly-4.9.0-py2.py3-none-any.whl (12.9MB)
    100% |████████████████████████████████| 12.9MB 392kB/s ta 0:00:01  1% |▌                               | 194kB 699kB/s eta 0:00:19    5% |█▋                              | 665kB 550kB/s eta 0:00:23    13% |████▍                           | 1.8MB 664kB/s eta 0:00:17    15% |████▉                           | 2.0MB 1.0MB/s eta 0:00:11    17% |█████▊                          | 2.3MB 655kB/s eta 0:00:17    25% |████████                        | 3.2MB 1.4MB/s eta 0:00:08    34% |███████████                     | 4.4MB 1.0MB/s eta 0:00:09    52% |█████████████████               | 6.8MB 512kB/s eta 0:00:12    53% |█████████████████               | 6.8MB 600kB/s eta 0:00:11    55% |█████████████████▊              | 7.1MB 769kB/s eta 0:00:08    65% |█████████████████████           | 8.5MB 187kB/s eta 0:00:24    67% |█████████████████████▋          | 8.7MB 707kB/s eta 0:00:06    69% |██████████████████████▎         | 8.9MB 686kB/s eta 0:00:06    71% |██████████████████████▉         | 9.2MB 1.0MB/s eta 0:00:04    72% |███████████████████████         | 9.3MB 945kB/s eta 0:00:04    78% |█████████████████████████▏      | 10.1MB 575kB/s eta 0:00:05    89% |████████████████████████████▊   | 11.6MB 361kB/s eta 0:00:04    94% |██████████████████████████████▏ | 12.1MB 1.2MB/s eta 0:00:01    97% |███████████████████████████████ | 12.5MB 1.1MB/s eta 0:00:01
Requirement already satisfied: six in /opt/conda/lib/python3.6/site-packages (from plotly) (1.12.0)
Collecting retrying>=1.3.3 (from plotly)
  Downloading https://files.pythonhosted.org/packages/44/ef/beae4b4ef80902f22e3af073397f079c96969c69b2c7d52a57ea9ae61c9d/retrying-1.3.3.tar.gz
Building wheels for collected packages: retrying
  Running setup.py bdist_wheel for retrying ... done
  Stored in directory: /home/jovyan/.cache/pip/wheels/d7/a9/33/acc7b709e2a35caa7d4cae442f6fe6fbf2c43f80823d46460c
Successfully built retrying
Installing collected packages: retrying, plotly
Successfully installed plotly-4.9.0 retrying-1.3.3
In [138]:
import plotly.express as px
fig = px.scatter(titanic,x='age',y='fare', color='sex',size='count')
fig.show()